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Abstract: This paper aims to focus, to bring the existing misperception of the letters 
‘p’ and ‘b’ in panoramic view, to make the readers understand that there is nothing 
defective on the side of the speakers’ in comprehending the sounds of the letters /p/ and 
/b/ when they listen to, alike other phonemic sounds. Further, it is to exercise the 
spectrographic picturization of the aforementioned letters for the scientific 
development, the theories involved in, and to bring out the linguistic phenomenon 
spread over the pronunciation of /p/ and /b/ phonemes. 
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INTRODUCTION 


Per Pamela Rogerson, “Difficulties can arise if English phonemes, those don“t 
exist in the L1”. For instance, if you converse with the native Arabic speakers, it is 
quite frequent to hear /b/ sound in place of /p/ as Arab students encounter trouble 
pronouncing the English letter /p/, as it doesn“t exist in the Arabic language, so they 
spell it with the closest sounding letter “ba” (~), which is repeatedly transcribed as /b/ 
of English. Ann Baker and Sharon Goldstein quoted it as a common problem that the 
speakers of Arabic may confuse /p/ and /b/, often replacing /p/ with /b/, but sometimes 


For example, you can hear: 


“Poetry” sounds like “Boethri” 
“Banda” for “Panda“ 
“barty“ instead “party 


“combuter* for “computer” 


This practice quite commonly prevailed in the 
Middle East region and this kind of pronunciation can 
be overheard in North Africa, such as Sudan, and Egypt 
even. The examples of this paper have been taken from 
the Jizan region of Saudi Arabia. I believe there are 
many teachers who found it trained the students 
practicing the sound, yet the result is the slightest. 
Accordingly, being an interested teacher of linguistics, I 
choose this to benefit the related knowledge to the 
contemporary environment, thus aimed to exemplify the 
whys and wherefores behind the difficulty of learning 
about the aforesaid allophones’ differences in view of 
linguistics, and try to provide the possible solutions to 
teach the variation using phonemes. Furthermore, to add 
the subjects of McGurk effect and the spectrographic 
study to present with "discrepant" auditory images, for 
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doing the reverse and replacing /b/ with /p/. 


example, with the most audible syllable /ba/ 
simultaneously with the visual syllable /pa/, which is 
not the most frequent known response in the Arab 
region. The present study targets to clarify the 
perceptional contrasts by providing a_ possible 
explanation for this difference as the fact that indicated 
is significantly a higher prevalence of McGurk-type 
integration for phonemes /p/ and /b/ and the findings are 
discussed in terms of their inferences for the 
development of aural restoration programs for ESL 
learners of Arabic origin. Exactly to say, the main 
messages of this applied paper constitutes perception of 
the two phonemes, the ways to acquire it and phonetic 
analysis using a basic spectrum of speech science and 
the hypotheses associated with this concept. 


Motive behind the mispronounce 


Among the Arabic letters, the nearby letter 
which is similar to the English letter /p/ is, “ba” (~). So, 
/p/ is usually realized as /b/ in written Arabic. The 
close Arabic equivalent of “ba” (~) is similar to the 
English /p/ except that the Arabic letter © (ba) is 
“voiced” as /b/ in English language, whereas /p/ is 
“voiceless”. 
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Linguistic Context 


To understand the root cause of this run through, it is 
necessary to look into the learner’s native or L 
language. There are 28 letters in the Arabic language 
and among those 28 letters, hardly any represents 
represent the linguistic formation of the English letter 
/p/. The precise articulation of vowels, consonants and 
their combinations is absolutely essential for the listener 
to comprehend the particular word that we speak. The 
oral parts like lips, teeth, soft palate, tongue, hard 
palate, are the organs above the vocal folds, and all 
together, constitutively expressed as vocal tract or 
articulators. Consonants are cognate pairs as they are 
articulated in the same place of manner and can be 
divided into two groupings called as voiced and 
voiceless. For example, in the present phonemes, 
linguistic terminology defines /p/ as a voiceless, bilabial 
and stop. To make it understandable, “Voiceless“ is the 
sound where vocal cords do not vibrate while 
pronouncing it, bilabial where it is pronounced using 
tow lips, i.e. upper lip and lower lip, it is a stop too as it 
stops the air flow a little while. The letter /P/ may be 
aspirated [P"] or un-aspirated [P=] or pre-glottalized or 
unreleased. If not, it is understandable, but with 
sloppiness. The counterpart of /p/ is /b/, which is 
voiced, bilabial and stop. It is a “Bilabial”, because it 
can only be articulated by closing two lips, and it is a 
“stop too, because articulating. this letter temporarily 
“stops“ the flow of air in the mouth, but it is “Voiced” 
sound as vocal cords vibrate while uttering. 


Production of the sound /b/: 


Halt the air behind the lips and allow the air 
flow through the lips to expel as if it is escaping out 
from the mouth. 

Eg. Words: Bell, label, lamb Phrases:Bulky 

banana, burn carbs 

Sentence: It’s the best way to curb the fabrics. 


Fortis 


Voiceless 


Production of the sound /p/: 


Halt the air behind the lips and allow the air 
flow through the lips to release the air with very little 
effort. 

Eg. Words: Pulp, captivate, jump 

Phrases: Unsurp the wrapper, prizing pomp 

Sentence: The push pull express is ready for 

public transport. 


Descriptions 

During the use of “stops” in phonetics, 
the airflow ceases as the vocal tract blocks. The 
examples of stops are /p/ and /b/, /t/ and /d/, /k/ and /g/. 
Among these /p/, /t/, and /k/ are voiceless stops. The 
initial voiceless stops are aspirated as in push, pan with 
a palpable puff of air upon release. When these 
voiceless stops are spoken with initial placing as in pack 
talk cake near a candle flame, the flame will gleam 
more after the words. The unvoiced stop phonemes and 
the voiced stop phonemes is not just a matter of 
presence of articulatory voicing or not, rather it includes 
starting of voicing, the presence of aspiration, i.e. the 
burst of air flow followed by the closure and the 
duration of the closure too. 


At this point, I would like to emphasize on 
four featured /p/ that need to be explained to the 
students in clear. The four features are: 

e = /P/ is a fortis- it is a consonant, particularly a 
voiceless consonant, strongly articulated, 
especially more so than another consonant 
articulated in the same place. 

e  /P/ is aspirate- pronounce (a sound) with an 
exhalation of breath. 

e —_/P/ is voiceless-unvoiced sounds occur when there 
is no vocal cord vibration. 

e /P/ is fricative- it is a consonant made by the 
friction of breath in a narrow opening, producing a 
turbulent air flow. 


Aspirate 


Picture-1: The four Properties of the phoneme /P/ 


It is possible and true to some extent that the 
learners will be able to distinguish /p/ from /b/, when 
they are exposed to these two sounds over and over. If a 
learner is exposed to the correct form of the language, 
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they will adopt it. As the exposure to English language 
is limited, this procedure may take a little longer to the 
ESL learners. 
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Repetition of the sounds /p/ and /b/ 
alternatively, using the words of those syllables, give 
training to the articulation, consequently labial exercises 
are drilled to produce the same. So, the best way of 
giving practice is the use of minimal pairs. 

Pie - Bye 

Pet - Bet 

Pride - Bride 
Nipping - Nibbing 
Mopping - Mobbing 
Cop — Cob 

Rope — Robe 

Back - Pack 


Aspiration 


Attention on the production of aspiration in 
syllable initial /p/ by making learners aware of the gust 
of air that goes together with the release of the 
consonant. Use a piece of paper or the palm of the hand, 
held in front of the mouth so that the students can feel 
the puff of air produced for the voiceless stops. 
Otherwise, get learners to put a /h/ before the vowel and 
then add the voiceless stop before it. 


For example: 


“p(h)ay” “c(h)ow” 


A method that is used in CELTA/DELTA 
teacher training programs is, by holding up a piece of 


tissue paper and train the learners to produce /b/ few 
times and /p/ few times. After that, ask the learners 
what did they observe about the movement of the paper 
when they kept near the mouth. They could observe the 
articulation of /b/ does not produce any air explosion 
from the mouth and they answer that the paper moved 

every time when they articulated the letter /p/. 

e The diaphragm is the key that regulates the body 
system. The observation of the movements of the 
diaphragm around the _ belly too _ gives 
consciousness of aspiration. 

e To produce the phoneme /p/, air is briefly streaked 
from leaving the vocal tract by bringing the lips 
together. The sound is aspirated when the air is 
unrestricted. The release of air for the phoneme /p/ 
is greater than the aspiration for the phoneme /b/. 


Speech Sound Spectra 


The spectrum is a kind of electronic voice 
phenomenon drawn by uploading a recorded audio file 
of a phoneme (consonant) through a phonetic software. 
The spectrogram is a three-dimensional plot and can be 
put on paper using praat software. The consonant 
sounds of the letters /p/ and /b/ are recorded and the 
spectrum was drawn using a phonetic software on a 
compatible system. The following (i) and (ii) figures are 
the subsequent electronic wave phenomenon of /p/ and 
/b/ phonemes on frequency and force on ‘x’ and ‘y’ 
axes. Change in amplitude corresponds to the change in 
frequency content. 


P 


B 


Picture-2: Sound spectrum of /P/ and /b/ 


The above diagram shows a __ graphic 
representation of the spectrum (power / pressure / 
amplitude vs. frequency) for the simple sounds of the 
letters /p/ followed by /b/, pronounced by a Non-Arabic 
speaker. Power can be measured in decibles, whereas 
frequency in hertz (Hz-Vibrations per a _ second). 
Monthly Mystery Spectogam Webzone-Rob 
Hagiwara’s blog says Voicing is represented on a wide 
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band spectrogram by vertical striations, especially in the 
lowest frequencies. Each vertical ‘line' represents a 
single pulse of the vocal folds, a single puff of air 
moving through the glottis. We sometimes refer to a 
‘voicing bar', i.e. a row of striated energy in the very 
low frequencies, corresponding to the energy in the first 
and second harmonics. Non-voicing is basically silence, 
and doesn't show up as anything in a spectrogram. 
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Hence, voiced sounds are in a striated voicing bar and 
voiceless sounds without striated voicing bar. Though 
there are only two sounds recorded, the spectrum has 
many other frequencies with short peaks, are from the 
rumble traffic or other sounds of the surrounding 
environment. Some of the numerical models predict that 
air through the glottis and the vocal fold vibrations 
depend on the pressure differences across the glottis and 
folds, and thus causes waves in the tract [1-3]. The 
frequencies of the maximum heights are called peaks. 


These spectral peaks are also called as formants. The 
tuft of peaks indicated, at which length the sound has 
been uttered. This is the signal energy over the 
frequency of the phoneme /p/ and phoneme /b/. The 
unvoiced stop sound /p/ is a burst instant with a sort of 
Time vs. Power spectrum. It is hard to identify the burst 
with the increasing power rate.. The clear burst 
waveform is observed in the spectrum of unvoiced 
stops. 


Picture-3: Representation of degree of variations in amplitudes of P and B 


The above, second diagram shows a graphic 
representation of the spectrum (power vs. frequency in 
Hz) for the sound of the letters /p/ and /b/ with the 
variations in the height of the peaks, which in turn 
shows the difference of increase between the 
frequencies of /p/ over /b/ ie. 0.3 and 0.2 
simultaneously on top and almost the same on the other 
side . When the speed of sound is great, the resonances 
occur at higher frequencies. This enables to know that 
the surface membrane of the vocal folds also vibrates at 
the same frequency, nevertheless, the implications 
occur at the same frequencies [1-3]. The amplitude of 
the sound varies with time and the amplitude of 
unvoiced /p/ sound is much lower. The energy of the 
sound provides a picture that replicates these amplitude 
variations. The wavy horizontal line is the frequency- 
amplitude plane or spectral envelope, which describes a 
point in time. The spectral envelope wraps around the 
amplified spectrum by joining the peaks or heights. The 
vertical lines are shaped by the overtone of the vibration 
of the vocal folds. Spectral envelope is only minimally 
affected by fluctuations of the fine structure in either 
voiced or unvoiced sounds. The spectral envelope G(Q) 
is represented by: 


M G(Q)= = c(m) cos (mQ) m=0 
Whereas, c(m) = cepstrum for the minimum phase 
system 


Cepstrum is defined as power spectrum of the logarithm 
of the power spectrum 
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As per many researchers, when the glottis 
opens for whispering, the resonance or formant peaks 
occur at higher frequencies [4-8]. The average opening 
of the glottis depends on what fraction of the time it is 
open (its “open quotient”) and how far it opens [9-11], 
which in turn depends on the voice register and pitch. 
Nearly all information in speech is in the range of 200 
Hz - 8 kHz, which is intelligible. The acoustic 
characteristics of voiceless consonants _are 
approximated between 2 and 6 with an average of 4.8 
kHz frequency range. The “Short time window 
analysis” is applied for the position for two frames. 


There is a new group of moderators in voice 
research. The acoustics of the vocal tract are often 
modelled using a mathematical model of a filter [12]. 
The frequencies of the poles of the filter model plunge 
near to the formants. In consequence of that, voice 
researchers usually refer the frequencies of the poles as 
formants. Formant is each of numerous protuberant 
bands of frequency that define the phonetic quality of a 
vowel, which can be seen clearly in speech sound 
spectra. To some voice researchers, the formant refers 
to a peak in the spectral envelope. Spectral envelope is 
a property of the sound of the voice. To some other 
voice researchers, the format refers to a resonance of 
the vocal tract, whereas a new group of voice research 
moderators, the format refers to a mathematical filter 
model. The formant frequencies can be calculated by 
linear prediction analysis of the poles. 
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The phonetic analysis 
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Picture-4: Signal of Intensities of the sounds /P/ and /B/ below the spectrum 


The study of a spectrum is a multifaceted 
development alike sound. The spectrogram is drawn 
taking time on the horizontal line and frequency in Hz 
on vertical line. The amplitude is shown by the degree 
of shading in black, in the above Spectrogram. This part 
of the spectrum shows two dark areas too. The darker 
the color indicates the darker the bands of frequency. 
The strips in black, in the spectral envelope are the 


5 BeIwmes 
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vibrations of the vocal folds. The green line represents 
the level of intensity of the sound spoken. The blue line 
in the spectrogram designates pitch. The intensity of the 
pitch is measured in decibels. Not only the pitch 
variation between voiced and unvoiced, the frequencies 
of the formants are different as well. 


Pitch parameter 


Picture-5: Picture showing the difference in pitch levels of the sounds /P/ and /B/ 


The above is the pitch contour, that gives a 
clear picture of difference in pitch for the phonemes /p/ 
and /b/. The pitch parameter extractor consists of a 
voiced-unvoiced (V/UV) decision associating with the 
average value of the spectral envelope components in 
the fundamental frequency region, and Noll’s cepstral 
peak picking. Climent Nadeu, Jordi Pascual and Javier 
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Hernando quoted that the voiced-to-unvoiced and 
unvoiced-to-voiced error rates are 0.7-1.5% and 1.5- 
2.5%, respectively. These error rates are fairly lower 
than those observed using cepstral peak picking. Pitch 
analysis, autocorrelation function and zero crossing rate 
is usually the methods used to make voiced-unvoiced 
decisions [13]. In the proposed system, speaker 
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dependent classification of voiced and unvoiced 
phonemes was made by using line spectrum 
frequencies. These are automatically constructed two 
classes of phonemes with respect to their articulation. 
According to Jacob Benesty, M. M. Sondhi, Yiteng 
Huang [14] the voiced sounds have periodicity, the 
spectrum possesses a line spectral structure consisting 
of a basic frequency and harmonics, therefore, noise can 
be decreased by leaving only this spectral part while 
suppressing the other parts. The accuracy of this method 
is determined by how accurately the pitches are 
distinguished voiced sounds can be detected from 
voiceless sounds. 


The pitch detection method is the most 
orthodox autocorrelation function method. Pitch is the 
essential frequency of the vocal folds’ vibrations. The 
voiced sounds namely vowel, semivowel and _ nasal 
sounds are pseudo periodic waves, hence, the basic 
pitch frequency is calculated from the autocorrelation 
function point that has the maximum level. The 
correlation value distinguishes voiced from voiceless 
sounds. Thus, there is no pitch for voiceless stops and 
fricatives such as /p/, /t/, /k/, /th/ & /ch/ and /f/, /s/. 


The Fact 


The first and foremost proof of speech 
comprehension is by listening and the observation of 
visual evidence of the lip movements, simultaneously 
and voluntarily. The listeners use the sound they heard 
and the visual prompts in the interim to understand and 
to continue the speech, also to cover the auditory 
information which is unheard. What is more interesting 
is that people use visual cues to process speech, even 
when the auditory signal is perfect [15], and a powerful 
multisensory illusion occurs with audiovisual speech, 
they added. The combining of aural and visual prompts 
during speech observation is termed audio-visual 
integration. The phenomenon is "The McGurk Effect" 
has been used to explore this audio-visual integration. 
Alsius et al., [16] exemplified as audiovisual integration 
impairment using reduced McGurk effect, for that 
instant Green ef al., [17] provided a good example i.e. 
dubbing a male voice onto a female talker and vice 
versa. The aural speech is recognized alone though, it is 
heard as another consonant after dubbing with different 
visual speech. The illusion has been named as “McGurk 
effect” or “Fusion effect”. Many researchers have 
defined the McGurk effect exclusively as the fusion 
effect because here integration results in the perception 
of a third consonant, obviously merging information 
from an audition and _ vision [18-20]. Later, 
interestingly, the McGurk effect is generated even when 
acoustic and visual speech inputs come from clearly 
different directions [21]. This illusion has been used by 
MacDonald and McGurk [22] themselves, and also by 
quite a lot of others [23, 24]. This issue has been 
elaborated previously in the extensive work by Massaro 
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and colleagues [25-28]. It is essentially vital because the 
identification accuracy of unisensory components is 
reflected into audiovisual speech perception. However, 
already McGurk and MacDonald [15] themselves wrote 
that “Lip movements for [pa] are frequently misread as 
[ba]”, because of perceptual illusion, where the visual 
input of the phoneme /p/ makes the Arabic listener 
miscomprehend to hear it as the known sound /b/ as the 
labial movement is quite similar to both the phonemes. 


CONCLUSION 


The aforementioned discussion furnishes the 
information about the analysis of voiced /b/ and 
unvoiced /p/ sounds. The voiced phoneme /b/ and 
unvoiced phoneme /p/ is an important research gizmo 
that shows that auditory and visual information is 
merged into an integrated percept, and the strength of 
an audiovisual integration has been proven through 
McGurk effect. The McGurk effect is a clear-cut 
variation in auditory speech is persuaded by visual 
perception, results in hearing a_ single precept, 
something other than what is said. This type of complex 
articulation involves "...separate and successive 
articulatory activities which together can be identified 
as a single segment" [29]. 


Fortis stops, /p, t, k/ are aspirated [p', t®, k*] 
when they occur at the beginning of a word, as in tap, 
tomato or at the beginning of a stressed syllable in the 
middle of a word, as in potato. They are un-aspirated [p, 
t, k] after /s/, as in stopper, spin, scate, and at the ends 
of syllables, as in sat, cap, luck. This helps the learners 
to pronounce the phoneme /p/ correctly and also serves 
as a self-assessment tool by which they can measure 
their own pronunciation and accuracy. The order of 
sound acquisition by young children, Owens [30] 
concluded the following: Among the consonant sounds, 
children generally first acquire the nasals, followed by 
the stops, glides and liquids, fricatives and affricates. 
Consonants, however, are more complicated to produce. 
The ways of producing consonants, or Manner of 
articulation, made stops, nasals, fricatives, affricates, 
glides, and liquids [31]. It is true a statement that the 
earliest sounds the children produce are the simplest for 
them to articulate, using the structure of their present 
vocal tract. These sounds are the vowels, nasals, and 
stops [32]. Speech production structures vary among 
sounds in accordance with the different style of 
articulations. Therefore, it is compulsory to apply an 
appropriate method to each sound in order to extract the 
essential features for syllable or phoneme recognition in 
continuous speech. 
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